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Abstract: The PANDA experiment will not use any hardware trigger, i.e. all raw data are stream- 
ing in the data acquisition with a bandwidth of <280 GB/s. The PANDA Online System is de- 
signed to perform data reduction by a factor of ~800 by reconstruction algorithms programmed 
in VHDL (Very High Speed Integrated Circuit Hardware Description Language) on FPGAs (Field 
Programmable Gate Arrays). 
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1. Introduction 

The PAN DA experiment at the future FAIR (Facility for Antiproton and Ion Resarch) facility at GSI 
Darmstadt, Germany, will investigate p+p and p+A collisions. It will be a fixed target experiment 
using a frozen hydrogen pellet target and a beam of <10 n stored and cooled antiprotons with 
a beam momentum p<l5 GeV/c. The beam momentum resolution will be 8p/p>\0~ 5 and the 
luminosity J*f <2xl0 32 cm _2 s _1 . Among many other topics, the physics program will cover the 
production of charmonium states in the reaction pp—>cc. If one adjusts the beam energy to resonant 
Jl\\f production for one year, and assumes a duty factor of 50%, this will correspond to a number 
of <2x 10 9 Jl\\f. In particular, PANDA will be able to measure the width of charmonium states in 
the order of >100 keV. Other physics topics ^ are spin physics (e.g. measurement of generalized 
parton distributions) and hypernuclear physics (e.g. production of double hypernuclear nuclei). 

PANDA will be one of the very few experiments worldwide not using any hardware trigger. 
All raw data will be streaming into the data acquisition (DAQ), and need to be filtered before being 
recorded to tape. The reason for this approach is, that signal events such as charmonium events in 
pp-^cc have a very similar event topology compared to background events such as pp—>uu, dd, 
ss. There are no straight-forward trigger criteria such as number of charged tracks or number of 
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neutral clusters in the calorimeter. Thus, the only way of data reduction is online reconstruction on 
a farm with high computing performance. Algorithms can be e.g. invariant mass reconstruction on 
a particular charmonium state, and then applying e.g. a cut on a signal in the invariant mass in the 
PANDA online system. 

2. The PANDA Experiment 

2.1 The PAN DA Detector 

One of the important tasks to be performed by the PAN DA online system will be the online par- 
ticle indentification (PID), i.e. assigning a probability that a given charged track is a pion, kaon, 
proton, electron or muon. For this purpose, the data of the central PANDA Cerenkov detector 
DRC (Detector for internally reflected Cerenkov light) plays an essential role. It is a detector of 
DIRC type, i.e. using internally reflected Cerenkov light, consisting of 16 quartz bars (refractive 
index «=1.47) of thickness d=l.l cm at a radius of R=4S cm. For the central tracking system, 
two detector options are still under evaluation, both covering a radial range of /?=15-41 cm: a TPC 
(Time Projection Chamber) with 135 padrows and in total 135, 169 pads of 2 x2 mm 2 size, or a STT 
(Straw Tube Tracker) with 4100 straw tubes with a tube radius R=l cm and a tube length L=1.5 m, 
aranged in 15 double layers. Axial or skewed arrangement with respect to the beam axis is used, 
the skewed tubes being used for z reconstruction. As part of the charged particle tracking near the 
target, an MVD (Micro Vertex Detector) consisting of ~10 7 silicon pixels of size 100 x 100 /im 2 
and ~7xl0 4 strips will be implemented. Further technical details about PANDA are described 
elsewhere 

2.2 The PANDA Data Acquisition System 

With a high event rate of <2 x 10 7 events/s and a raw event size of 4-20 kB (average 14 kB) PAN DA 
will reach a data rate of <280 GB/s, the same order of magnitude as LHC experiments. As a 
difference, PAN DA will not utilize any hardware triggers, but all raw data will be streamed to the 
DAQ. The baseline hardware platform for the PANDA DAQ system are Compute Nodes (CN), 
which will be described in detail in Ch. ||. The CNs will run online reconstruction algorithms 
programmed in VHDL on FPGAs for data reduction. All data digitization will be performed even 
in a stage before the CNs by the frontend electronics. Further details can be found elsewhere 

2.3 The PANDA Offline Computing System 

The PANDA offline computing system is characterized by the large amount of data to be recorded. 
The final rate of events written to tape, at a stage behind the online data reduction system, is 
designed as 25 kHz. Assuming one year of data taking with a duty factor of 50%, this corresponds 
to 3.78 x 10 11 events. With an estimated event size of ~4 kB for DSTs 1 (Data Summary Tapes), 
this corresponds to >1,5 Pbyte per year, or ~378,000 DVDs. Including not only the DSTs, but 
also raw data, Monte-Carlo simulated data, reconstructed detector hit data etc., an estimate for the 
amount of data to be stored for only the first year of PANDA data taking will be ~ 11.5 Pbyte. The 

'DSTs will be the final reduced data set to be used for physics analyses. They contain e.g. 4-momenta of charged 
particles and neutral particles, but no reconstructed detector hit data anymore. 
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offline computing will be performed on ~2000 quad core CPUs for reconstruction, analysis and 
MC production. 

2.4 The PANDA Online Computing System 

From the constraints of the data acquisition system on the one side and of the offline computing 
system on the other side, the requirement for the online computing system can be defined, i.e. to 
reduce 2x 10 7 events/s raw data to 25 x 10 3 event/s to be recorded to tape. This corresponds to a 
reduction factor of ~800. 

3. The HADES Experiment 

First test beams for PANDA are envisaged for 2016. However, in particular the programming of 
the algorithms has already started. In order to be able to test online algorithmus already by now 
with real data, data from the HADES experiment were used, which studies dielectron events in 
p+p, p+A and A+A collisions, e.g. for investigating the behaviour of vector mesons inside nuclear 
matter. These vector mesons are detected by their decay into e + e~. Therefore HADES uses a 
RICH (Ring Imaging Cerenkov) detector for e + and e~ identification. A ring finder is used online 
on the Level-2 trigger system. Charged tracks are identified in HADES by 4 drift chambers of 
trapezoidal shape with ~30 m 2 of active area. 2 chambers are located in front and 2 behind a 
solenoid field for momentum measurement. Each drift chamber has 6 layers of wires, arranged in 
different angles for assigning a hit position and a track direction in each chamber. The HADES 
RICH detector has 55,296 readout pads of different geometrical shapes. Signal rings induced 
by e + or an e~ have a fixed ring radius of 4 pads. Further details are described elsewhere [0]. 
Thus, several of the algorithmus for PANDA (e.g. ring finder and track finder) can be tested (with 
modifications) already on real data from HADES. In addition, HADES will be upgraded in the near 
future, in order to be prepared for heavy collision systems such as Au+Au collisions with high track 
multiplicity and thus higher required data bandwidth. Therefore a new data acquisition system and 
Level-2 trigger system has been proposed based on the CN, and the algorithms could be part of the 
upgraded trigger system. 

4. The Compute Node 

The proposed hardware unit to perform the online reconstruction at PANDA is the COMPUTE 
NODE (CN) and is shown in Fig. [p. The 14-layer printed circuit board has been developed by 
IHEP Beijing and the II. Physics Department of University Giessen. Each CN has five VIRTEX-4 
FX-60 FPGAs (Field Programmable Gate Arrays). These FPGAs were chosen, as they combine 
high computing performance on the one hand and links for high bandwidth data transfer (Rock- 
etlO) on the other hand. One main feature of the board design is, that all FPGAs are connected 
point-to-point (see also below for details) in order to (a ) combine data of different regions of one 
detector, processed by different FPGAs, and (b) combine data of different detectors within one 
event (i.e. event building). The programming of the FPGAs in VHDL is using XILINX ISE (In- 
tegrated Software Environment) Vers. 10.1. As an important note for algorithm design, FPGAs 
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only provide fixed 2 point arithmetics. Thus, for any calculations such as matrix multiplications or 
trigonometric functions, (a) the parameter range has to be fixed (in order to limit it into a given fixed 
precision range), and (b) lookup tables have to be used instead of calculating arithmetics functions. 
Each Virtex-4 FX60 FPGA has two 300 MHz PowerPCs implemented as core, however, these are 
only used for slow control purposes and not for algorithms. In the current design, the Power PCs 
are booting Linux 2.6.27. In addition, each FPGA has 2 GB of DDR2 memory attached. The 
power negotiation and other slow control tasks between the CN and the ATCA shelf is based upon 
IPMI (Intelligent Platform Management Interface), implemented by an ATMEL ATmega2560 mi- 
crocontroller on a CN add-on card The CN is designed as a board of the ATCA (Advanced 
Telecommunications Computing Architecture) standard. The ATCA shelf is shown in Fig. [j] In 
an ATCA shelf with a full mesh backplane, point-to-point connections from each CN to each other 
CN are wired. This avoids any bus arbitration. In addition to the high computing performance, 
the CN also provide high bandwidth interconnections, (a) All 5 FPGAs are connected pairwise 
(on the board) by one 32-bit general purpose bus (GPIO) and one full duplex RocketIO link, (b) 4 
of 5 FPGAs have two RocketIO links routed to front panel using Multi-Gigabit Receivers (MGT) 
for optical links, (c) One of the 5 FPGAs serves as a router and has 16 RocketIO links through 
the full mesh backplane to all the other compute nodes in the same ATCA shelf, f d) All 5 FPGA 
have a Gigabit Ethernet Link routed to front panel. With the current design, the input bandwidth 
in one ATCA crate is <35 GB/s (14 CN, eight optical links each, operating at <2.5 Gbit/s). The 
output bandwidth is ~2.6 GB/s (14 CN, five GB Ethernet links each, operating at 0.3 Gbit/s TCP 
performance, measured in [||]). All RocketIO links are currently operated with <2.5 Gbit/s, but the 
upgrade to <6.5 Gbit/s is envisaged, which would lead to even higher required reduction factors. 
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Figure 1. Left: Photo of a prototype of the Compute Node (CN). Right: Photo of an ATCA Shelf. 



There are softcores for floating point calculations for FPGAs available, however, the performance is not competetive 



to other architectures such as GPUs (see Ch. 5.t ) 



-4- 



5. Algorithms 



As mentioned above, one of the important tasks to be performed by the online reconstruction 
system at PANDA will be the online particle identification (PID). Several subtasks have to be 
accomplished in order to achieve online PID assignment: 

(a) An online ring finder for the DRC, whereas Cerenkov photons propagate and are reflected 
inside the quartz bars, then exit the bars at the downstream end, and generate rings in the focal 
plane. After applying the ring finder, the ring radius R r i ng is known. 

(b) a track finder and a track fitter for charged tracks, with hits in the MVD and STT or TPC. After 
the stage of the track fitter, the 3-momentum p tra cfa and in particular the size of the momentum 
Ptrack=\ptmck\ of the charged track is known. 

(c) the extrapolation of the track onto the surface of the DRC (in order to know, at which Ztrack 
position the particle entered the quartz bar). 

(d) The Cerenkov angle $r. erenkov is a function of the two parameters R r i ng and Ztrack, and will be 
implemented as a lookup table in the online system. 

(e) The final PID decision will be based upon a 2-dimensional plot of &Q erenkov vs. p tr ack- 

These algorithm steps will be performed the farm of CN, which was described in Ch. ^ In the 
following, examples will be given for track finder and ring finder algorithms. These algorithms are 
either tested with Monte-Carlo data for PANDA or real data for HADES. 



5.1 Track Finder Algorithm for HADES 

A straight line track finder algorithm was tested with HADES data using the 2 drift chambers 
in front of the B field, i.e. <12 fired wires out of 21 10 wires define a track. The algorithm was fully 
implemented on an FPGA. The processing time of the FPGA was compared to the CPU time of C 
program, performing the same track finder task, but running on a Xeon 2.4 GHz. For different fired 
wire multiplicities N w i re = 10-400 a speedup of a factor 10.8-24.3 with respect to the reference was 
achieved. 



5.2 Ring Finder Algorithm for HADES 

The existing HADES online ring finder system is implemented on a VME board with 12 Xilinx 
XC4028EX FPGAs [^]. As such it is part of the HADES Level-2 trigger system || and is in 



operation for several years of data taking [Bp [10]. For an improved algorithm, to be implemented 



on the CN for the HADES upgrade project, the matching of a ring with a track (from the two 
drift chamber planes in front of the solenoid field) is foreseen [pT|]. Rings are only searched in 
regions-of-interest in the pad plane, given by areas of 13 x 13 pads, centered around a pad, which' 
position was found by track extrapolation. As the RICH uses a mirror, reflecting the Cerenkov light 
onto a pad plane in upstream direction, another coordinate transformation is required by usage of 
a lookup table. The pad plane for a typical signal and a typical background event is shown in 
Fig. |2[ In order to quantitatively compare for the old and the new algorithm, the enrichment factor 
for lepton candidates for real data is evaluated. The enrichment factor is defined as the ratio of 
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the efficiency 3 and the reduction factor 4 For 12 C+ 12 C at 1 AGeV, using the new algorithm, the 
enrichment increases from 8.9 to 14.6, while the efficiency drops only from 93% to 91%. For 
40 Ar+ 40 [KCl] at 1.756 AGeV with a higher track density the enrichment increases only from 1.7 to 
2.0, again with a minor efficiency drop from 91% to 90%. 




Figure 2. Hit maps for the pad plane of the HADES RICH detector for C+ C collisions at a beam energy 
of 1 AGeV. Left: Dielectron candidate event with two rings. Right: Background event with a charged particle 
crossing the padplane. Extrapolated tracks for matching with ring center are shown as red crosses. 



5.3 Track Finder Algorithm for PANDA 

A helix track finder was developed for PANDA [JI^]. It was tested with Monte-Carlo simulated 
data for STT and MVD, i.e. 30 plus <7 hits per track. A field of B z =2 T was used with field maps 
correctly treating overlap with the magnetic dipole field in the PAN DA forward spectrometer. The 
algorithm is based upon two steps. In the first step, a conformal transformation is applied. For 
every x,y coordinate of hits in the STT or MVD, new coordinates x'=(x-xo)/r 2 and y'=(y-yo)lr 2 
with r 2 =(x-xo) 2 +(y-yo) 2 are calculated. In a projection onto xy plane, helix tracks are circles. The 
conformal map transforms these circles into straight lines, which can be indentified easier as tracks 
by a track finder. In the second step, a Hough transform is applied. For any combination of (x,y) 
coordinates a straight line is formed, and the polar coordinates r and 6 are calculated. A normal 
vector with a 90° angle with respect to the line is constructed. The parameter r is the distance from 
(x=0,y=0) along the normal vector to the line, and the parameter 6 is the polar angle of the normal 
vector in the xy frame. Then all the new coordinates are filled into a 2-dimensional (r,d) histogram, 
and a peak finder is applied. A peak in this histogram corresponds to a found track. Fig ||| (left) 
shows the Hough space for 10 tracks of p=l GeV/c. The algorithm uses fix point arithmetics with 
24 bit precision, in division and multiplication increased to 48 bit. The size of the Hough space was 
adjusted to 512x512. The lookup table for the sinus function uses 128 values of 16 bit precision. 



Fig |3| (right) shows the momentum resolution for p=l GeV/c tracks. As a preliminary result [ |12| ] 
the efficiency of the online track finder is only ~20% worse compared 5 to the offline algorithm. 
The pj resolution is only worse by a factor ~2.5. For an online data reduction these values are 
acceptable. 



3 The efficiency is defined as the number of good positive triggers, divided by the sum of the numbers of good positive 
and false negative triggers. 

4 The reduction factor is defined as the sum of the numbers of good positive triggers and false positive triggers, 
divided by the number of downscaled triggers. 

5 The comparison between the online and the offline track finder algorithm was performed for events containing 10 
tracks with the same momentum, e.g. p=l GeV/c, but random variation of the pj. 
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Figure 3. Left: Hough space for 10 tracks with p=l GeV/c. For details see text. Right: Reconstructed 
momentum for tracks with p=l GeV/c. pj and polar angle # of the tracks are varied randomly. The fit 
function is given by a double Gaussian. The momentum resolution is o(p)lp=2.9%. 



5.4 Event Selector Algorithm for HADES 

In order to test the speed of data moving on the CN, an event selector algorithm was tested with 
HADES data [Q. The algorithm was designed for (a ) reading HADES binary events from DDR2 
memory (a) partially decoding the event, (a) issuing an accept or reject decision, (a) discarding 
the event or writing it back to the DDR2 memory, depending on which decision was issued. For a 
DMA block size of 32 kB, for 100% (25%) accepted events the algorithm reached a throughput of 
~80 MB/s(~150MB/s). 



5.5 Additional Algorithms 

The matching of HADES tracks with the HADES time-of-flight and the HADES electromag- 
netic shower system requires track extrapolation through the B field. As a preliminary result, 
for 40 Ar+ 40 [KCl] at 1.756 AGeV a reduction of ~2 and an enhancement of ~1.8 was achieved 
at an efficiency of ~90% [Qj. In addition, a track finder only based on hits of a silicon vertex 
detector (i.e. 2 layers of a pixel detector and 4 layers of a strip detector) was tested for the Belle II 
experiment [|l2|]. 



5.6 Graphics Processing Units 

As a novel approach for fast data processing, a track fitter based upon a conformal map transforma- 
tion within the PandaRoot 2 . framework [|16|] was tested on an NVidia Tesla C1060 Graphics 



Adapter [15]. The card has 240 cores and a single precision floating point performance of 933 
GFLOPS. For the calculations on the GPU (Graphics Processing Unit), the NVidia CUDA frame- 
work [0 was used and interfaced to PandaRoot. The syntax of CUDA is very similar to the 
ANSI C programming language. The track finder for MVD and TPC was running in PandaRoot 
for tracks with generated p=l GeV and 50-2000 tracks/event. Then the hit data of the track can- 
didates were transfered from the host PC to the GPU, where the track fitting was performed in 32 
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parallel threads in the next step. The fitted track data were transferred back to the PC. The per- 
formance of the complete algorithm was compared between running with and without GPU (i.e. 



host PC alone). A speed-up of a factor <68 [15] was achieved. Thus, GPUs seem to be attractive 



solution for high level processing which require floating point operations and are not possible on 
an FPGA. 
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